This report explores the Loan Data from Prosper which are proposed in the Data Set Options document. Through the exploration of the data, my objective is to understand how the Borrower Rate is determined for each Prosper Loan.
This dataset contain 81 variables concerning more than 114000 Prosper Loan.
## [1] 113937 81
## [1] "ListingKey"
## [2] "ListingNumber"
## [3] "ListingCreationDate"
## [4] "CreditGrade"
## [5] "Term"
## [6] "LoanStatus"
## [7] "ClosedDate"
## [8] "BorrowerAPR"
## [9] "BorrowerRate"
## [10] "LenderYield"
## [11] "EstimatedEffectiveYield"
## [12] "EstimatedLoss"
## [13] "EstimatedReturn"
## [14] "ProsperRating..numeric."
## [15] "ProsperRating..Alpha."
## [16] "ProsperScore"
## [17] "ListingCategory..numeric."
## [18] "BorrowerState"
## [19] "Occupation"
## [20] "EmploymentStatus"
## [21] "EmploymentStatusDuration"
## [22] "IsBorrowerHomeowner"
## [23] "CurrentlyInGroup"
## [24] "GroupKey"
## [25] "DateCreditPulled"
## [26] "CreditScoreRangeLower"
## [27] "CreditScoreRangeUpper"
## [28] "FirstRecordedCreditLine"
## [29] "CurrentCreditLines"
## [30] "OpenCreditLines"
## [31] "TotalCreditLinespast7years"
## [32] "OpenRevolvingAccounts"
## [33] "OpenRevolvingMonthlyPayment"
## [34] "InquiriesLast6Months"
## [35] "TotalInquiries"
## [36] "CurrentDelinquencies"
## [37] "AmountDelinquent"
## [38] "DelinquenciesLast7Years"
## [39] "PublicRecordsLast10Years"
## [40] "PublicRecordsLast12Months"
## [41] "RevolvingCreditBalance"
## [42] "BankcardUtilization"
## [43] "AvailableBankcardCredit"
## [44] "TotalTrades"
## [45] "TradesNeverDelinquent..percentage."
## [46] "TradesOpenedLast6Months"
## [47] "DebtToIncomeRatio"
## [48] "IncomeRange"
## [49] "IncomeVerifiable"
## [50] "StatedMonthlyIncome"
## [51] "LoanKey"
## [52] "TotalProsperLoans"
## [53] "TotalProsperPaymentsBilled"
## [54] "OnTimeProsperPayments"
## [55] "ProsperPaymentsLessThanOneMonthLate"
## [56] "ProsperPaymentsOneMonthPlusLate"
## [57] "ProsperPrincipalBorrowed"
## [58] "ProsperPrincipalOutstanding"
## [59] "ScorexChangeAtTimeOfListing"
## [60] "LoanCurrentDaysDelinquent"
## [61] "LoanFirstDefaultedCycleNumber"
## [62] "LoanMonthsSinceOrigination"
## [63] "LoanNumber"
## [64] "LoanOriginalAmount"
## [65] "LoanOriginationDate"
## [66] "LoanOriginationQuarter"
## [67] "MemberKey"
## [68] "MonthlyLoanPayment"
## [69] "LP_CustomerPayments"
## [70] "LP_CustomerPrincipalPayments"
## [71] "LP_InterestandFees"
## [72] "LP_ServiceFees"
## [73] "LP_CollectionFees"
## [74] "LP_GrossPrincipalLoss"
## [75] "LP_NetPrincipalLoss"
## [76] "LP_NonPrincipalRecoverypayments"
## [77] "PercentFunded"
## [78] "Recommendations"
## [79] "InvestmentFromFriendsCount"
## [80] "InvestmentFromFriendsAmount"
## [81] "Investors"
Let’s have a look to the variable of interest for our study: “BorrowerRate”. The Borrower rate goes from zero to almost 0.5. The mean value is around 0.2 and the distribution seems to have two modes: one around 0.15 and one around 0.25.
## BorrowerRate
## Min. :0.0000
## 1st Qu.:0.1340
## Median :0.1840
## Mean :0.1928
## 3rd Qu.:0.2500
## Max. :0.4975
The distribution of the estimated Loss has a mean value of 0.08 with which lower than the Borrower rate. The aim of the differences is to cover of the costs: paiement defaults, service fee, … It is likely that the Borrower rate is correlated to the the Loss Estimation.
## EstimatedLoss
## Min. :0.00490
## 1st Qu.:0.04240
## Median :0.07240
## Mean :0.08031
## 3rd Qu.:0.11200
## Max. :0.36600
The Prosper rating and the Prosper score are representing the level of risk of a loan. The Loss estimation is probably based on these scores. Most loan have a middle range rating / score.
## ProsperScore
## Min. : 1.00
## 1st Qu.: 4.00
## Median : 6.00
## Mean : 5.95
## 3rd Qu.: 8.00
## Max. :11.00
The Loan duration can have 3 differents values: 12, 36 and 60 months. Most of the loans are 36 and 60 months long.
The bar graph of the LonStatus variable shows that most of the Loans are still on-going, around 40000 are completed and that there is a relatively high number of loans that have been considered has chargedoff.
Almost half of the Borrower are owner of their house, and are working. From the distribution and the summary of the employment status duration it can be seen that half of the loans are made by people with less 67 months of seniority in their current job. It would have ben interesting to get the age of the borrower to compare with.
## EmploymentStatusDuration
## Min. : 0.00
## 1st Qu.: 26.00
## Median : 67.00
## Mean : 96.07
## 3rd Qu.:137.00
## Max. :755.00
The bar plots show that few loans are attributed to people earning less $25000.
## [1] "$0" "$1-24,999" "$100,000+" "$25,000-49,999"
## [5] "$50,000-74,999" "$75,000-99,999" "Not displayed" "Not employed"
The following distributions shows that out of the 114000 Listing, around 22000 have already had a Prosper Loan (20%)
The following distributions represent the number of On-time Prosper payments for all loans on-going and for the completed Loan only. The distribution are very similar. Picks are clearly visible for 1 and 36 months. There is almost no pick at 60 months (2 possibilities: no 60 months are finished or no 60 months loans ends up with 60 on-time payments). It can also be seen that even for “completed loan”, the number of on-time payement with spread through the complete range of month whereas there are not a lot of payments more than one month late. My hypothesis is that the a lot of loans are not kept up to the final payment.
The distribution shows that most loans have been made in 2013. The distribution of the amouts show some picks at $4000, $7000, $10000, $15000, $20000 and $25000. There are also small picks for $30000 and $35000. The main Monthly Loan Payment is around $200.
The main reasons provided for contracting a loan are Debt consolidation (ListingCategory equal to 1). Most Loans are made by people living in Californie.
At the time the credit is pulled around 30% of the people have had 1, 2 or 3 inquiries in the last 6 months. This result is coherent since the main reason for contracting the Prosper loan is a Debt Consolidation.
A 75% of the people contracting one Prosper Loan have more than 4 opened revolving accounts. The Median Monthly payment for the revolving account is $270.
## OpenRevolvingAccounts
## Min. : 0.00
## 1st Qu.: 4.00
## Median : 6.00
## Mean : 6.97
## 3rd Qu.: 9.00
## Max. :51.00
## OpenRevolvingMonthlyPayment
## Min. : 0.0
## 1st Qu.: 114.0
## Median : 271.0
## Mean : 398.3
## 3rd Qu.: 525.0
## Max. :14985.0
A large proportion of the Borrower have Delinquencies at the time of the Listing, which is explained by the fact the loan is made to consolidate their Debt.
A small proportion of the Prosper Loan Borrower have had a Public Record the 12 months preceeding the listing.
25% of the Borrower have their Bank card credit at 84% or more of usage and the mean bank card credit is $11200. The financial behaviour and status of the Borrower is probably correlated to the Rate that will be proposed.
## BankcardUtilization
## Min. :0.0000
## 1st Qu.:0.3100
## Median :0.6000
## Mean :0.5613
## 3rd Qu.:0.8400
## Max. :5.9500
## AvailableBankcardCredit
## Min. : 0
## 1st Qu.: 880
## Median : 4100
## Mean : 11210
## 3rd Qu.: 13180
## Max. :646285
A low Debt to Income Ratio show the possibility for a Borrower to pay every month for is loan, provided that it has an income range high enough.
The Occupation of the Borrower can’t bring information since a large number has an occupation flagged as “Other” (25%).
There are around 114 000 Prosper Loan listing. Each listing is composed of 81 features. Several set of features can be identified: * administrative information of each listing (key, number, creation date, grade, duration, status,…) * financial information on the loan (borrower rate, estimated losss,…) * Personal information on the borrower himself * Financial informationon the borrower * Prosper Loan information on the listing
The main feature of interest of this dataset in my opinion is the Borrower Rate. My objective is to determined how the rate is determined for each Loan. The sum borrowed, the duration and the borrower incomes should be taking part of the rate definition.
The financial information of the borrower should contribute to a kind of risk analysis that should ponderate the rate obtained by the borrower.
I created the variables LoanOriginationDate.month and LoanOriginationDate.month.bucket. The Loan interest are probably depending on the date when the Loan has been contracted.
Each time a sum in dollar was involved, I tried a log transformation of the axis in order to include the outliers in the vizualisation of the distribution (e.g. AvailableBankcardCredit, AmountDelinquent, OpenRevolvingMonthlyPayment). Concerning the IncomeRange variable I re-ordered the levels in order to improve the readability of the plot with increasing range of incomes.
From the Correlation computed, the borrower rate seems to be highly correlated is the variable EstimatedLoss. This should be normal since the this variable is a kind of risk estmation that probably integrates the various information gathered on the Borrower (current financial information and previous record if the borrower had already contracted a Prosper Loan). In addition, the borrowere rate is correlated with the AvailableBankcardCredit, the DebtToIncomeRatio, the StatedMonthlyIncome, the LoanOriginalAmount and the LoanOriginationDate.month.
The Borrower rate is decreasing with the increase of the income range.
As highlighted by the correlation, the borrower rate is correlated with the AvailableBankcardCredit. However the relation is relatively low and there is a big dispersion of the data.
The zero is the x axis correspond to jan 2005. The next plot shows the variation of the mean loan rate with the origination month of the loan.
The correlation of the Borrower rate with the EstimatedLoss is strong. However some additional parameters should enter into account to determine completely the borrower rate.
The loan original amount is correlated to the borrower rate. The higher the amount, the lower the rate.
For the lower value of the MonthlyLoanPayment (up to 750$), a decreasing trend can be observed.
The duration of the Loan has an influence on the borrower rate. The shorter the loan, the lower the rate.
The strong correlation has been seen on between the Borrower rate and the EstimatedLoss. The objective of the next 3 plots is to look for correlation between the EstimatedLoss and other variable.
The Estimated return is correlated with the EstimatedReturn. The variable definition is: > The estimated return assigned to the listing at the time it was created. Estimated return is the difference between the Estimated Effective Yield and the Estimated Loss Rate.
2 patterns can be identified in the plot: * horizontal lines * curves where the estimated loss increase with the estimated return
The same remarks can be done for the following plot between EstimatedLoss and EstimatedEffectiveYield.
The borrower rate is strongly correlated with the EstimatedLoss for the Loan. In addition, correlation between the borrower rate and the AvailableBankcardCredit, the StatedMonthlyIncome, the LoanOriginalAmount have been highlited with the 2-dimensionnal plots. In addition to these correlation a natural link between the borrower rate and the LoanOriginationDate.month variable has been observed.
Despite the week correation between the borrower rate and the duration of the loan, a boxplot has highlighted that the longer the duration the higher the mean borrower rate.
Yes, the various estimations are correlated between them (EstimatesEffectiveYield, EstimatedLoss, EstimatedReturn). I also see correlation between the EstimatedLoss and the same variables than the one correlated with the borrower rate (which is normal since borrower rate and estimated loss are strongly correlated)
The strongest relationship is between the borrower rate and the estimationLoss.
The following plot highlights a linear trends between the borrower rate and the AvailableBankcardCredit. In addition to this linear trend, a second dependency between the borrower rate and the IncomeRange at iso-AvailableBankcardCredit can be seen. However, the dispersion is relatively big around these trends.
This plot shows again the dependency between the borrower rate and the AvailableBankcardCredit. The colors are pretty well staged. It can also be seen that the range of borrower rates depends on when the Loan has been contracted.The relation between the Borrower rates and the AvailableBankcardCredit looks strong.
The next plot show again the relation between the borower rate and the Income of the Borrower.However the relation, even if visible with the line representing the evolution of the mean, is weaker than the one with the AvailableBankcardCredit.In addition, before 2009 (2005+55/12), the relation is less visible.
The conclusion for the next plot with the LoanOriginalAmount is the same as the previous one for the IncomeRange.
The next plot show that only 36 month loan up to 2011. 12 and 60 month loans started only from thi date. The data after 2011 show that in average the borrower rate for 12 months are much lower than for 36 and 60 months.
The next plot shows a dependency between the borrower rate and the DebtToIncome parameters. However again even if the relation is visible on the mean, the borrower rate are still very dispersed.
The next plot show the strongest relation up to now in this study between the borrower rate and the EstimatedLoss. It is expected that the riskier the Loan, the higher the borrower rate in order to protect the lender from bad finacial operation in case of deliquencies.
The next plot is similar to the previous one but with a split against the loan duration since it has been highlighted that shorter loan had lower rate in average. It can be seen that no EstimatedLoss data is available before 2009. Two additional remarks can be done: * Before 2011 only 36 months were made * The dependency of the Borrower rate to the Estimated Loss is less clear before 2011
Because of the 2 previous remarks, let’s plot the Borrower Rate by the Estimated Loss, the year the loan has been initiated and the duration of the loan. The depency between the Borrower Rate and the EstimatedLoss is almost linear for a given year and a given duration. For the 12 and 36 months loans, the linear dependency is remaining almost the same whatever the year the loan has been initiated. On the contrary, a modification of the dependency can be observed for the 60 months loan: for a given estimated Loss, a borrower got a lower rate in 2014 than in 2012.
In the previous plot, a clear behaviour has been indentified fot the loan after 2011. Let’s try to do the same work for the loan before 2011. From the information available in the description of the variable, I know that the EstimatedLoss variable is available since 2009. However, it doesn’t mean that it has been use from this date. The next plot shows that the way the Borrower Rate is determined is not the same before and after 2011 even if the EstimatedLoss has been made available in 2009.
It has been made clear that before 2011, a less rigourous process was used to define the Borrower Rate. In order to better understand the process, some correlations have been computed. The CreditScoreRange variable seems to be highly correlated with the BorrowerRate. In additional it seems that the AvailableBankcardCredit is also playing a role.
As highlighted with the correlation computation, the next multi-variate plot depictes the BorrowerRate by the AvailableBankcardCredit and the CreditScoreRangeLower.
The final plot for this part is there to highlight the main parameters that are used to evaluate the EstimatedLoss. DebtToIncomeRatio, BankcardUtilization and AvailableBankcardCredit looks to enter into account to compute the EstimatedLoss but the remaining dispersion shows that other parameters are also taking part to the formula.
My variable of interest was the Borrower rate. Throuh the Multivariate analysis, I found a strong realtionship between the borrower rate, when the Loan has been contracted, the EstimatedLoss parameter which evaluate the risk of the Loan and the duration of the Loan.
I also tried to find how the EstimatedLoss parameter was correlated to the other parameters. I didn’t succeed finding a clear relation between the EstimatedLoss and a few of the other parameters. Each time, a large dispersion was remaining even if I managed to highlight some correlations (e.g. with DebtToIncomeRatio, BankcardUtilization and AvailableBankcardCredit). I come to the conclusion that this parameter is probably the result of a complex calcul based on a large number of fiancial information on the borrower in order to evaluate the risk.
Once the EstimatedLoss is determined, the borrower rate look determined by a direct relationship depending on the current global cost of the money at the time the loan is made and the duration of the loan. Maybe here what interpelled me is the cultural differences for the borrower rate definition: in France, a big part of the rate you can get for a loan is linked to your negociation skills whereas in this case it looks pretty mathematical.
No, I didn’t create any models because I come to the conclusion that the Borrower rate what probably depending on the date when the loan is made. As this basic money rate has not a regular variation through time, it doesn’t look easy to create a linear model of the Borrower rate.
However I imagine that if I remove the seasonal variation of the loan rate from the data, I suppose the difference of the borrower rate with the seasonal variation would not any more depend on when the loan has been made. Then, it would probably possible to create linear model providing the difference of the borrower rate with the current global money cost as a function of the EstimatedLoss and the Loan duration.
The distribution of Borrower rate appears bimodal: one mode around 0.15 and one mode around 0.25. In addition, the distribution highlights some picks at some specific values (0.2, 0.32, 0.35).
Boxplot of the Borrower Rate by income range. The higher the income range, the lower the Borrower Rate median value.The Inter-Quartile Range seems to be stable for the income range higher than $25000. For the income range $1-24999, the Inter-Quartile looks bigger.
The plot shows that a direct relation between the borrower rate after 2011 and the Estimated loss exists. The relation is almost linear whatever the duration of the Loan. The relation between the Borrower Rate and the Estimated Loss remains almost the same through the years for the 12 and 36 months loans. For the 60 months loans, the relation evolve through the years. For a given Estimated Loss, a Borrower got a lower rate in 2014 than in 2012. The company responsible for the Prosper Loan has probaby made evolve the relation so that its product remain competitive with regards to the competitors through the years.
In addition, the split on the Loan duration shows that the Borrower rate relation with the Estimated Loss is also depending on the duration of the loan.
The Prosper Loan dataset contains around 114000 Loan Listing information. depending on when the loan has been contracted some information where missing. The study made on this dataset highlight that the Borrower rate proposed are highly correlated to the Estimated Loss parameter and the Loan duration. In addition since the money cost evolves with the international context, the average Borrower rate varies through time.
Even if I manage to explain how the Borrower Rate was defined based on the Estimated Loss, the date of creation and the duration of the loan, I am not fully satisfied because, I didn’t succeed in finding a good explaination of how the Estimated Loss was determined. I highlighted several correlations but I supposed that the way the Estimated Loss is computed takes into account to much variables with too complex formula to be reverse engineer simply with correlation analysis.
Concerning the time dependency of the borrower rate, I would supposed that it should be possible to find data providing the global rate of the money cost in order to remove the time variation of the Borrower rate of the Loan data. This way, it may have been possible to construct a model between the borrower rate, the estimated loss and the duration.